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Abstract —Electron beam lithography (EBL) is a promising maskless 
solution for the technology beyond 14nm logic node. To overcome its 
throughput limitation, industry has proposed character projection (CP) 
technique, where some complex shapes (characters) can be printed 
in one shot. Recently the traditional EBL system is extended into 
multi-column cell (MCC) system to further improve the throughput. In 
MCC system, several independent CPs are used to further speed-up 
the writing process. Because of the area constraint of stencil, MCC 
system needs to be packed/planned carefully to take advantage of 
the characters. In this paper, we prove that the overlapping aware 
stencil planning (OSP) problem is NP-hard. To solve OSP problem in 
MCC system, we present a tool, E-BLOW, with several novel speedup 
techniques, such as successive relaxation, dynamic programming, and 
KD-Tree based clustering. Experimental results show that, compared 
with previous works, E-BLOW demonstrates better performance for both 
conventional EBL system and MCC system. 

Keywords —Electron Beam Lithography, Overlapping aware Stencil 
Planning, Multi-Column Cell System 

1 Introduction 

As the minimum feature size continues to scale to sub- 
22nm, the conventional 193nm optical photolithography 
technology is reaching its printability limit. In the near 
future, multiple patterning lithography (MPL) has be¬ 
come one of the viable lithography techniques for 22nm 
and 14nm logic nodes Q-Q. In the longer future, i.e., for 
the logic nodes beyona I4nm, extreme ultra violet (EUV), 
directed self-assembly (DSA), and electric beam lithogra¬ 
phy (EBL) are promising candidates as next generation 
lithography technologies Currently, both EUV and 
DSA suffer from some technical barriers. EUV technique 
is delayed due to tremendous technical issues such as 
lack of power sources, resists, and defect-free masks ||^. 
DSA has only the potential to generate contact or via 
layers 0. 

The preliminary version has been presented at IEEE/ACM Design Automation 
Conference (DAC) in 2013. 

B. Yu and D. Z. Pan are with the Department of Electrical and Computer 
Engineering, University of Texas, Austin, TX 78731 USA. 

K. Yuan was with the Department of Electrical and Computer Engineering, 
University of Texas, Austin, TX 78731 USA. He is now with Facebook Inc., 
Menlo Park, CA 94025 USA. 

J-R. Gao was with the Department of Electrical and Computer Engineering, 
University of Texas, Austin, TX 78731 USA. She is now with Cadence Design 
Systems, Austin, TX 78752 USA. 


EBL system, on the other hand, has been developed 
for several decades 0. Compared with the traditional 
lithographic methodologies, EBL has several advantages. 
(1) Electron beam can be easily focused into nanometer 
diameter with charged particle beam, which can avoid 
suffering from the diffraction limitation of light. (2) 
The price of a photomask set is getting unaffordable, 
especially through the emerging MPL techniques. As a 
maskless technology, EBL can reduce the manufacturing 
cost. (3) EBL allows a great flexibility for fast turnaround 
times and even late design modifications to correct or 
adapt a given chip layout. Because of all these advan¬ 
tages, EBL is being used in mask making, small volume 
LSI production, and R&D to develop the technological 
nodes ahead of mass production. 

Conventional EBL system applies variable shaped 
beam (VSB) technique. In this mode, the entire layout is 
decomposed into a set of rectangles, each being shot into 
resist by one electron beam. In the printing process of 
VSB mode, at first the electrical gun generates an initial 
beam, which becomes uniform through the shaping aper¬ 
ture. Then the second aperture finalizes the target shape 
with a limited maximum size. Since each pattern needs 
to be fractured into pieces of rectangles and printed one 
by one, the VSB mode suffers from serious throughput 
problem. 

One improved technique is called character projection 
(CP) E where the second aperture is replaced by a sten¬ 
cil Some complex shapes, called characters, are prepared 
on the stencil. The key idea is that if a pattern is pre¬ 
designed on the stencil, it can be printed in one electronic 
shot, otherwise it needs to be fractured into a set of 
rectangles and printed one by one through VSB mode. 
By this way the CP mode can improve the throughput 
significantly. In addition, CP exposure has a good CD 
control stability compared with VSB pO) . However, the 
area constraint of stencil is the bottleneck. Eor modern 
design, due to the numerous distinct circuit patterns, 
only limited number of patterns can be employed on 
stencil. Those patterns not contained by stencil are still 
required to be written by VSB. Thus one emerging chal¬ 
lenge in CP mode is how to pack the characters into 
stencil to effectively improve the throughput. 

Even with decades of development, the key limita- 
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Fig. 1. Printing process of MCC system, where four CPs 
are bundled. 


tion of the EBL system has been and still is the low 
throughput. Recently, multi-column cell (MCC) system 
is proposed as an extension to CP technique [ [TT| , l@. In 
MCC system, several independent character projections 
(CP) are used to further speed-up the writing process. 
Each CP is applied on one section of wafer, and all 
CPs can work parallelly to achieve better throughput. In 
morden MCC system, there are more than 1300 character 
projections (CPs) |[T3]|. Since one CP is associated with 
one stencil, there are more than 1300 stencils in total. 
The manufacturing of stencil is similar to mask man¬ 
ufacturing. If each stencil is different, then the stencil 
preparation process would be very time consuming and 
expensive. Due to the design complexity and cost con¬ 
sideration, different CPs share one stencil design. One 
example of MCC printing process is illustrated in Pig. 
where four CPs are bundled to generate an MCC system. 
In this example, the whole wafer is divided into four 
regions, wi^W 2 -,w^ and w/^, and each region is printed 
through one CP. Note that the whole writing time of 
the MCC system is determined by the maximum one 
of the four regions. Por modern design, because of the 
numerous distinct circuit patterns, only limited number 
of patterns can be employed on stencil. Since the area 
constraint of stencil is the bottleneck, the stencil should 
be carefully designed/manufactured to contain the most 
repeated cells or patterns. 

Many previous works dealt with the design optimiza¬ 
tion for EBL system. |[T4||, [ [T5| considered EBL as a 
complementary lithogra^y tecfmique to print via/cut 
patterns. |T^ , jlT) solved the subfield scheduli ng p rob- 
lem to reduce the critical dimension distortion. |)18|-|[^ 
proposed a set of layout/mask fracturing approaches 
to reduce the VSB shot number. Besides, several works 
solved the design challenges under CP technique. | |2T) , 
p2] proposed several character design methods for both 
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Fig. 2. Two types of OSP problem, (a) IDOSP. (b) 2DOSP. 

via layers and interconnect layers to achieve stencil area- 
efficiency. 

As one of the most challenges in CP mode, stencil 
planning has earned many attentions p^-pS). When 
blank overlapping is not considered, the stencil planning 
equals to a character selection problem. p3) proposed 
an integer linear programming (ILP) formulation to se¬ 
lect a group of characters for throughput maximiza¬ 
tion. When the characters can be overlapped to save 
more stencil space, the corresponding stencil planning 
is referred as overlapping-aware stencil planning (OSP). 

investigated on OSP problem to place more 
characters onto stencil. Recently, p^ , p7| assumed that 
the pattern position in each character carTbe shifted, and 
integrated the character re-design into OSP problem. As 
suggested in p4|, the OSP problem can be divided into 
two types: IDOSP and 2DOSP. In IDOSP, the standard 
cells with same height are selected into stencil. As shown 
in Fig. I^a), each character implements one standard cell, 
and the enclosed circuit patterns of all the characters 
have the same height. Note that here we only show 
the horizontal blanks, and the vertical blanks are not 
represented because they are identical. In 2DOSP, the 
blank spaces of characters are non-uniform along both 
horizontal and vertical directions. By this way, stencil 
can contain both complex via patterns and regular wires. 
Fig. I^b) illustrates a stencil design example for 2DOSP. 

Compared with conventional EBL system, MCC sys¬ 
tem introduces two main challenges in OSP problem. 
First, the objective is new: in MCC system the wafer is 
divided into several regions, and each region is written 
by one CR Therefore the new OSP should minimize the 
maximal writing times of all regions. However, in con¬ 
ventional EBL system the objective is simply minimize 
the wafer writing time. Besides, the stencil for an MCC 
system can contain more than 4000 characters, previous 
methodologies for EBL system may suffer from runtime 
penalty. However, no existing stencil planning work has 
been done toward the MCC system. 

This paper presents E-BLOW, a comprehensive study 
to the MCC system IDOSP and 2DOSP problems. Our 
main contributions are summarized as follows. 

• We provide the proof that both IDOSP and 2DOSP 
problems are NP-hard. 

• We formulate integer linear programming (ILP) 
to co-optimizing characters selection and physi- 




















































cal placements on stencil. To our best knowledge, 
this is the first mathematical formulation for both 
IDOSP and 2DOSP. 

• We proposes a simplified formulation for IDOSP. 

• We present a successive relaxation algorithm to find 
a near optimal solution. 

• We design a KD-Tree based clustering algorithm to 
speedup 2D0SP solution. 

The rest of this paper is organized as follows. Section 
1^ provides problem formulation. Section presents al¬ 
gorithmic details to resolve IDOSP problem in E-BLOW, 
while section m details the E-BLOW solutions to 2D0SP 
problem. Section [^reports experimental results, followed 
by the conclusion in Section!^ 

2 Preliminaries 

In this section, we provide the preliminaries regarding 
overlapping aware stencil planning (OSP). During char¬ 
acter design, blank area is usually reserved around its 
boundaries. Note in this paper, the blank space refers 
to the blank around the character boundaries. The term 
"overlapping" means sharing blanks between adjacent 
characters. By this way, more characters can be placed 
on the stencil p4) . In this section, first we will provide 
the detailed pi^lem formulation, then we will prove 
that both IDOSP and 2D0SP are NP-hard. 

2.1 Problem Formulation 

In an MCC system with P CPs, the whole wafer is 
divided into P regions {ri, r 2 ,..., rp}, and each region 
is written by one particular CP. We assume cell extraction 
has been resolved first. In other words, a set of 
character candidates {ci, • • • , c^} has already been given 
to the MCC system. For each character candidate q, its 
writing time through VSB mode is denoted as while 
its writing time through CP mode is 1. 

The regions of wafer have different layout patterns, 
and the throughputs would be also different. Suppose 
character candidate q repeats tic times on region Tc- 
Let ai indicate the selection of character candidate q as 
follows. 

1, candidate q is selected on stencil 
0, otherwise 

If Ci is prepared on stencil, the total writing time of 
pattern q on region Vc is tic * 1- Otherwise, q should 
be printed through VSB. Since region Cc comprises tic 
candidate q, the writing time would be tic’Ui. Therefore, 
for region Vc the total writing time Tc is as follows: 

n n 

Tc = ^ ^ CLi ' {tic ' f) T ^ CLi) ' {^ic ’ '^i) 

n n 

— ^ ^ tic ' ^ ^ tic ' {^i 1 ) ’ 

n 

= TP^-Y,Ric^ai 
i=l 


where we denote Uc-rii, and Ric = tic{ni- 

1). shows the writing time on Vc when only VSB 

is applied, and Ric represents the writing time reduction 
of candidate q on region Vc- In MCC system, for each 
region Vc both and Ric are constants. Therefore, 

the total writing time of the MCC system is formulated 
as follows: 

'^totai — rn^^fTc} 

= max{TP^ ai},\/c€ P 

Based on the notations above, we define the overlap¬ 
ping aware stencil planning (OSP) for MCC system as 
follows. 

Problem 1. OSP for MCC System: Given a set of character 
candidate , select a subset out of as characters, 
and place them on the stencil. The objective is to minimize the 
system writing time Tfotai expressed by Eqn. Q, while the 
placement of is bounded by the outline of stencil The 
width and height of stencil is W and H, respectively. 

For convenience, we use the term OSP to refer OSP 
for MCC system in the rest of this paper. 

2.2 NP-Hardness 

In this subsection we will prove that both IDOSP and 
2D OSP are NP-hard. To facilitate the proof, we first 
define a Bounded Subset Sum (BSS) problem as follows. 

Problem 2 (Bounded Subset Sum). Given a list of n 
numbers xi, • • • ,Xn and a number s, where Mi G [n] 2 • > 

^max(= max \xi\), decide if there is a subset of the numbers 

ie[n] 

that sums up to s. 

For example, given three numbers 1100,1200,1413 and 
T = 2300, we can find a subset {1100,1200} such that 
1100 + 1200 = 2300. Additionally, we can assumption 
that t > c - Xmax/ where c is some constant. Otherwise it 
be solved in 0{rf) time. Besides, without the bounded 
constraint Vi G [n] 2'Xi > Xmax/ the BSS problem becomes 
Subset sum problem, which is in NP-complete pO) . For 
simplicity of later explanation, let S denote the set of n 
numbers. Note that, we can assume that all the numbers 
are integer numbers. 

Theorem 1. BSS problem is NP-complete. 

The proof is in Appendix. In the following, we will 
show that even a simpler version of IDOSP problem is 
NP-hard. In this simpler version, there is only one row 
in the stencil, and a set of characters C is given. Besides, 
the blanks of each character are symmetric, and each 
character q G C is with the same length w. 

Definition 1 (Minimum packing). Given a subset of char¬ 
acters C' G C, its minimum packing is the packing with the 
minimum stencil length. 



Lemma 1. Given a set of character C = {ci, C 2 ,..., c^} 
placed on a single row stencil If for each character Ci e C, 
both of its left and right blanks are Si, then the minimum 
packing is with the following stencil length 


n 


n 


i=l 


max{si} 

ie[n] 


(2) 


Proof: Without loss of generality, we assume that 
Si > S 2 >'”> Sn- We prove by recursion that in 
an minimum length packing, the overlapping blank is 
f{n) = there are only two characters, it is 

trivial that /(2) = S 2 . We assume that when p = n — 1, 
the maximum overlapping blank f{n — 1) = 

For the last character c^, the maximum sharing blank 
value is 5^. Since for any i < n, Si > Sn, we can 
simply insert it at either the left end or the right end, 
and find the incremental overlapping blank Sn- Thus 
/(n) = /(n — 1) H- Sn = Yh =2 because the maximum 
overlapping blank for all characters is Yl ^=2 
see the minimum packing length is as in Eqn. §• □ 


Lemma 2. BSS <p IDOSP. 

Proof: Given an instance of BSS with s and S = 
{xi,X 2 ,... ,Xn}, we construct a IDOSP instance as fol¬ 
lows: 

• The stencil length is set to M s, where M = 

• For each Xi G S', in IDOSP there is a character q, 
whose width is M and both left and right blanks 
are M — Xi. Since Xi > M/2, the sum of left blank 
and right blank is less or equal to M. 

• We introduce an additional character cq, whose 
width size is M, and both left and right blanks are 
M - 

• The VSB writing time of character cq is set to 

while the VSB writing time for each 
character q is set to Xi. The CP writing times are 
set to 0 for all characters. 

• There is only one region, and each character ci 
repeats one time in the region. 

For instance, given initial set S = {1100,1200,2000} 
and s = 2300, the constructed IDOSP instance is shown 
in Fig. g 

We vvTll show the BSS instance S = {xi,X 2 ,..., has 
a subset that adds up to s if and only if the constructed 
IDOSP instance has minimum packing length M + 5 and 
total writing time smaller than ^ Xi . 

part) After solving the BSS problem, a set of items 
S' are selected that they add up to s. For each Xi G S', 
character q is also selected into the stencil. Besides, since 
the system writing time for cq is 'ffxi, it is trivial to see 
that in the IDOSP instance the cq must be selected. Due 
to the Lemma the minimum total packing length is 

{n + \)-M -^{M -Xi)=M+^Xi = M + s 

ies' ies' 
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Fig. 3. (a) IDOSP instance for the BSS instance S = 

{1100,1200,2000} and s = 2300. (b) The minimum pack¬ 
ing is with stencil length M + s = 2000 + 2300 = 4300. 

Meanwhile, the minimum total writing time in the 
IDOSP is "^Xi — s. 

(<^ part) We start from a IDOSP instance with mini¬ 
mum packing length M-\-s and total writing time smaller 
than Xi, where a set of character C' e C are selected. 
Since the total total writing time must be smaller than 
^Xi, character cq G C'. For all characters in set q g C' 
except Co, we select Xi into the subset S' G S, which adds 
up to s. 

□ 

Theorem 2. IDOSP is in NP-hard. 

Proof: Directly from Lemma and Theorem □ 

Theorem 3. 2DOSP is in NP-hard. 

Since IDOSP is a special case of 2DOSP. Due to the 
NP-hardness of IDOSP, the 2DOSP problem is NP-hard 
as well. Combining Theorem 0 and Theorem we 
can achieve the conclusion thatMSP problem, even for 
conventional EBL system, is NP-hard. 


3 E-BLOW FOR IDOSP 

When each character implements one standard cell, the 
enclosed circuit patterns of all the characters have the 
same height. The corresponding OSP problem is called 
IDOSP, which can be viewed as a combination of charac¬ 
ter selection and single row ordering problems p4) . Dif¬ 
ferent from two-step heuristic proposed in | [^ , we show 
that these two problems can be solved simultaneously 
through a unified ILP formulation For convenience. 
Tablelists the notations used in IDOSP problem. 































TABLE 1 

Notations used in 1D-ILP Formulation 


lU 

width constraint of stencil or row 

n 

number of characters 

m 

number of rows 

Xi 

x-position of character Ci 

Wi 

width of character Ci 


horizontal overlap between Ci and Cj 

Pij 

0-1 variable, pij = 0 if is left of Cj 

aij 

0-1 variable, aij = 1 if is on jth row 


min Tfotai 


(3) 

n m 

S.t Ttotal > • ««) 

yc€P 

&) 

i=l k = l 



0 < Xi < W — Wi 

Vi 

H) 

^ ^ aik ^ 1 
k = l 

Vi 

#) 

Xi Wij Xj ^ W (2 Pij aik 



Xj -j- Wji Xi ^ W (3 Pij aik 


#) 

aik, ajk,Pij : 0 - 1 variable 


ilf) 


In formulation (|^, W is the stencil width, m is the 
number of rows. For each character q, Wi and Xi are the 
width and the x-position, respectively. If and only if Ci is 
assigned to k-th row, aik = 1. Otherwise, aik = 0 . Con¬ 
straints ( [3p) ( [^ are used to check position relationship 
between q andcj. Here Wij = Wi — a-^ and Wji = 
where is the overlapping when candidates q and cj 
are packed together. Only when aik = cijk = 1/ be. both 
character i and character j are assigned to row k, one 
of the two constraints ( |3pl ) ( [3^ will be active. Besides, 
for any three characters ci, C 2 , C 3 being assigned to row 
k, i.e., aik = a 2 k = « 3 /e = h the pi 2 ,Pi 3 and P 23 are self- 
consistent. That is, if ci is on the left of C 2 (pu = 0) and 
C 2 is on the left of C 3 (p 23 = 0 ), then ci should be on the 
left of C 3 (pi 3 = 0). Similarly, if ci is on the right of C 2 
(pi 2 = 1 ) and C 2 is on the right of C 3 (P 23 = 1 ), then ci 
should be on the right of C 3 (pis = 1 ) as well. 

Since ILP is a well known NP-hard problem, directly 
solving it may suffer from long runtime penalty. One 
straightforward speedup method is to relax the ILP 
into the correspondin g lin ear programming (LP) through 
replacing constraints ( |3|f| ) by the following: 

0 ^ ^iki ^jkiPij ^ 1 

It is obvious that the LP solution provides a lower 
bound to the ILP solution. However, we observe that 
the solution of relaxed LP could be like this: for each i, 
aij = 1 and all the pij are assigned 0.5. Although the 
objective function is minimized and all the constraints 
are satisfied, this LP relaxation provides no useful infor¬ 
mation to guide future rounding, i.e., all the character 



Fig. 4. E-BLOW overall flow for IDOSP. 


candidates are selected and no ordering relationship is 
determined. 

To overcome the limitation of above rounding, E- 
BLOW proposes a novel successive rounding framework 
to search near-optimal solution in reasonable runtime. 
As shown in Fig. the overall flow includes several 
steps: Simplified ILP formulation. Successive Rounding, 
Fast ILP Convergenc e, R efinement, Post-swap and Post- 
Insertion. In section |3.1| the simplified formulation will 
be discussed, and its LP rounding lower bound will be 
proved. In section 13.21 the details of su ccessive rounding 
would be introduced. In section 13.31 the Fast ILP con¬ 


vergence technique would be presented. In section 
the refinement process is proposed. At last, to furt 


ner 


improve the performance, in section |3.5| the post-swap 
and post-insertion techniques are discussed. 


3.1 Simplified ILP Formulation 

As discussed above, solving the ILP formulation ^ is 
very time consuming, and the related LP relaxation may 
be bad in performance. To overcome the limitations of 
in this section we introduce a simplified ILP formula¬ 
tion, whose LP relaxation can provide good lower bound. 
The simplified formulation is based on a symmetrical 
blank (S-Blank) assumption: the blanks of each character 
are symmetric, i.e., left blank equals to right blank, si is 
used to denote the blank of character q. Note that for 
different characters q and Cj, their blanks Si and Sj can 
be different. 

At first glance the S-Blank assumption may lose opti¬ 
mality. However, it provides several practical and the¬ 
oretical benefits. (1) In | [^ the single row ordering 
problem was transferred into Hamilton Cycle problem, 
which is a well known NP-hard problem and even 
particular solver is quite expensive. In our work, instead 
of relying on expensive solver, under this assumption 
the problem can be optimally solved in 0{n). (2) Under 
S-Blank assumption, the ILP formulation can be effec¬ 
tively simplified to provide a reasonable rounding bound 

















































max ^ ^ aij • profiti 

i j 


(4) 

s.t. — Si) • aij <W — Bj 

Vj 

&) 

Bj ^ Si • aij 

Vi, j 


1 

Vi 


j 

aij = 0 or 1 

Vi, j 

®) 


theoretically. Compared with previous heuristic frame¬ 
work 1^, the proved rounding bound provides a better 
guideline for a global view search. (3) To compensate 
the inaccuracy in the asymmetrical blan k cases, E-BLOW 
provides a refinement (see section [3^ . 

The simplified ILP formulation is shown in Eqn. (|^. 

In the objective function of Eqn. 0, each charac¬ 
ter Ci is associated with one profit v^e profiti. The 
profiti value is to evaluate the overall system writing 
time improvement if character q is selected. Through 
assigning each character q with one profit value, we 
can simplify the complex constraint More details 
regardi ng t he profit value setting would^e discussed in 
Section |3^ Besides, due to Lemma constraint and 
constraint are for row width calculation, where ([4^ 
is to linearize max operation. Here Bj can be viewecf^ 
the maximum blank space of all the characters on row 
Vj . Constraint ( [4p| implies each character can be assigned 
into at most one row. It's easy to see that the number of 
variables is 0{nm), where n is the number of characters, 
and m is the number of rows. Generally speaking, single 
character number n is much larger than row number 
m. Thus compared with basic ILP formulation the 
variable number in (|^ can be reduced dramatically. 

In our implementation, we set Si to \{sli + sri)/2], 
where sli and svi are c/s left blank and right blank, 
respectively. Note that here the ceiling function is used to 
make sure that under the S-Blank assumption, each blank 
is still integral. Although this setting may loss some 
optimality, E-BLOW provides post-stage to compensate 
the inaccuracy through incremental character insertion. 

Now we will show that the LP relaxation of (|^ has 
reasonable lower bound. To explain this, let us first look 
at a similar formulation © as follows: 


max ^ ^ aij • profiti 

i j 


(5) 

s.t. — Si) ■ ttij <W — maXg 

Vj 

§1) 

(4c) - {Ad) 




where maxs is the maximum horizontal blank length 
of every character, i.e. maxg = max{si|i = l,2,...,n}. 
Program (|^ is a multiple knapsack problem A 

multiple knapsack is similar to a knapsack proBlem, 
with the difference that there are multiple knapsacks. 


Algorithm 1 SuccRounding ( thinv ) 

Require: ILP Pormulation ^ 

1: Set all aij as unsolved; 

2: repeat 

3: Update profiti for all unsolved 

4: Solve relaxed LP of 

5: repeat 

6: Opq ^ max{aij}; 

7: for all aij > a^q x thinv do 

8: if Ci can be assigned to row Vj then 

9: aij = 1 and set it as solved; 

10: Update capacity of row r^; 

11: end if 

12: end for 

13: until cannot find apq 

14: until 


In formulation (|^, each profiti can be rephrased as 

{wi — Si) X ratioi. 

Lemma 3. If each ratioi is the same, the multiple knapsack 
problem Q can find a O.b—approximation algorithm using LP 
rounding method. 

Por brevity we omit the proof, detailed explanations 
can be found in When all ratioi are the same, formu¬ 
lation ^ can be approximated to a max-flow problem. 
In addition, if we denote a as mm{ratioi} /max{ratio^}, 
we can achieve the following Lemma: 

Lemma 4. The LP rounding solution of§ can be a 0.5n— 
approximation to optimal solution of^. 

Proof: Eirst we introduce a modified formulation to 
program (|^, where each profiti is set to mm{profiti}. 
In other words, in the modified formulation, each ratioi 
is the same. Let OPT and OPT' be the optimal values of 
^ and the modified formulation, respectively. Let APR' 
be the corresponding LP rounding result in the modified 
formulation. According to Lemma APR' > 0.5 • OPT'. 
Since xmn{profiti} > profiti * o:, we can get OPT' > 
a • OPT. In summary, APR' > 0.5 • OPT' > 0.5a • OPT. 

□ 

The difference between 0 and ^ is the right side 
values at and Blank spacing is relatively small 
comparing with the row length, we can get that W — 
maXs ~ W — Bj. Then we can expect that program (|^ 
has a reasonable rounding performance. 

3.2 Successive Rounding 

In this subsection we propose a successive rounding 
algorithm to solve program (|^ iteratively. Successive 
rounding uses a simple iterative scheme in which frac¬ 
tional variables are rounded one after the other until an 
integral solution is found The ILP formulation (|^ 
becomes an LP if we relaxThe discrete constraint to a 
continuous constraint as: 0 < < 1. 









The details of successive rounding is shown in Algo¬ 
rithm At first we set all aij as unsolved since none 
of them is assigned to rows. The LP is updated and 
solved iteratively. For each new LP solution, we search 
the maximal and store in a^q (line 6). Then we find 
all Oij that is closest to the maximum value a^q, i.e., 
Oij > Opq X thinv 1^ our implementation, thinv is set 
to 0.9. For each selected variables Oij, we try to pack 
Ci into row Vj, and set Oij as solved. Note that when one 
character q is assigned to one row, all Oij would be set 
as solved. Therefore, the variable number in updated LP 
formulation would continue to decrease. This procedure 
repeats until no appropriate Oij can be found. One key 
step of Algorithm [u is the profiti update (line 3). For 
each character q, we set its profiti as follows: 

profiti = y] 7 ^ • - 1) ■ tic (6) 

^ ^max 

where tc is current writing time of region Tc, and tmax = 
max {tc,Vc G P}. Through applying the profiti, the 
region rc with longer writing time would be consid¬ 
ered more during the LP formulation. During successive 
rounding, if q is not assigned to any row, profiti would 
continue to be updated, so that the total writing time of 
the whole MCC system can be minimized. 

3.3 Fast ILP Convergence 



Fig. 5. Unsolved character number along the LP iterations 
fortestcases 1M-1, 1M-2, 1M-3, and 1M-4. 

During successive rounding, for each LP iteration, we 
select some characters into rows, and set these charac¬ 
ters as solved. In the next LP iteration, only unsolved 
characters would be considered in formulation. Thus 
the number of unsolved characters continues to decrease 
through the iterations. For four test cases (lM-1 to IM- 
4), Fig. [^illustrates the number of unsolved characters in 
each iteration. We observe that in early iterations, more 
characters would be assigned to rows. However, when 
the stencil is almost full, fewer of Oij could be close 
to 1. Thus, in late iterations only few characters would 
be assigned into stencil, and the successive rounding 
requires more iterations. 

To overcome this limitation so that the successive 
rounding iteration number can be reduced, we present 


Algorithm 2 Fast ILP Convergence ( Lth, Uth ) 

Require: Solutions of relaxed LP (j^; 

1: for all Oij in relaxed LP solutions do 
2 : if Oij < Lth then 

3 : Set Oij as solved; 

4 : end if 

5 : if Oij > Uth then 

6: Assign Ci to row rj) 

7 : Set Oij as solved; 

8 : end if 

9 : end for 

10: Solve ILP formulation (j^ for all unsolved Oij 

11: if Oij = 1 then 

12: Assign Ci to row rj) 

13 : end if 



Fig. 6. For test case 1M-1, solution distribution in last LP, 
where most of values are close to 0. 


a convergence technique based on fast ILP formulation. 
The basic idea is that when we observe only few char¬ 
acters are assigned into rows in one LP iteration, we 
stop successive rounding in advance, and call fast ILP 
convergence to assign all left characters. Note that in p5) 
an ILP formulation with similar idea was also applied. 
The details of the ILP convergence is shown in Algorithm 
[^ The input are the solutions of last LP rounding, and 
two parameters Lth and Uth- First we check each Oij 
(lines 1-9). If Oij < Lth, then we assume character q 
would be not assigned to row rj, and set Oij as solved. 
Similarly, if Oij > Uth, we assign q to row rj and set 
Oij as solved. For those unsolved Oij we build up ILP 
formulation to assign final rows (lines 10-13). 

At first glance the ILP formulation may be expensive 
to solve. However, we observe that in our convergence 
Algorithm [^ typically the variable number is small. 
Fig. [^ illustrates the solution distribution in last LP 
formulation. We can see that most of the values are 
close to 0. In our implementation Lth arid Uth are set 
to 0.1 and 0.9, respectively. For this case, although the 
LP formulation contains more than 2500 variables, our 
fast ILP formulation results in only 101 binary variables. 
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as a set {w, /, r, O), where w is the total length of the order, 
I is the left blank of the left character, r is the right blank 
of the right character, and O is the character order. At the 
beginning, an empty solution set S is initialized (line 1). 
If k = 1, then an initial solution (ici, s/i, sri, {ci}) would 
be generated (line 2). Here and sri are width 

of first character ci, left blank of ci, and right blank of 
Cl. If /c > 1, then Refine{k) will recursively call Refinefk- 
1) to generate all old partial solutions. All these partial 
solutions will be updated by adding candidate Ck (lines 
5-9). 


Fig. 7. Greedy based Single Row Ordering, (a) At first all 
candidates are sorted by blank space, (c) One possible 
ordering solution where each candidate chooses the right 
end position, (e) Another possible ordering solution. 

3.4 Refinement 

Refinement is a stage to solve the single row ordering 
problem j^, which adjusts the relative locations of input 
p characters to minimize the total width. Under the S- 
Blank assumption, because of Lemma this problem 
can be optimally solved through the following two-step 
greedy approach. 

1) All characters are sorted decreasingly by blanks; 

2) All characters are inserted one by one. Each one 
can be inserted at either left end or right end. 

One example of the greedy approach is illustrated in 
Fig-IZl where four character candidates A, B, C and D are 
to be ordered. In Fig. I^a), they are sorted decreasingly 
by blank space. Then all the candidates are inserted one 
by one. From the second candidate, each insertion has 
two options: left side or right side of the whole packed 
candidates. For example, if A is inserted at the right of 
D, B has two insertion options: one is at the right side 
of A (Fig. I^b)), another is at the left side of A (Fig. 
[^d)). Given different choices of candidate B, Fig. [^c) 
and Fig. [^e) give corresponding final solutions. Since 
from the second candidate each one has two choices, 
by this greedy approach n candidates will generate 
possible solutions. 

For the asymmetrical cases, the optimality does not 
hold anymore. To compensate the losing, E-BLOW con¬ 
sists of a refinement stage. For n characters {ci,..., c^}, 
single row ordering can have n\ possible solutions. We 
avoid enumerating such huge solutions, and take advan¬ 
tage of the order in symmetrical blank assumption. That 
is, we pick up one best solution from the possible 
ones. Noted that although considering 2^~^ instead of 
n\ options cannot guarantee optimal single row packing, 
our preliminary results show that the solution quality 
loss is negligible in practice. 

The refinement is based on dynamic programming, 
and the details are shown in Algorithm ^ Refine (k) 
generates all possible order solutions for the first k 
characters {ci,..., c/.}. Each order solution is represented 


Algorithm 3 Refine(k) 

Require: k characters {ci,..., c/^}; 

1: if k = 1 then 

2: Add {wi,sli,sri,{ci}) into S) 

3 : else 

4 : Refine(k-l); 

5 : for each partial solution (w,l,r,0) do 

6: Remove {w,l,r,0) from S', 

7: Add {w ^ Wk — min(5r/c, l),slk^r, {c/c, O}) into 

8: Add {w ^ Wk — min(s//e, r), /, srk, {O, Ck}) into 

S'; 

9 : end for 

10: if size of S > threshold then 

11: Prune inferior solutions in S; 

12: end if 

13 : end if 


We propose pruning techniques to speed-up the dy¬ 
namic programming process. Let us introduce the con¬ 
cept of inferior solutions. For any two solutions Sa = 
{wa.la^ra.Oa) and Sb = {wb.lb^n^Ob), we say Sb is 
inferior to Sa if and only if Wa > Wb, la < h and 
Ta Si Vb- Those inferior solutions would be pruned during 
pruning section (lines 10-12). In our implementation, the 
threshold is set to 20. 

3.5 Post-Swap and Post-Insertion 

After refinement, a post-swap stage is applied to further 
improve the performance. In each swap operation, an 
unselected character would be swapped with a character 
on stencil, if such swap can improve the writing time. 
The post-swap is implemented using a greedy flavor that 
consists of two steps. First, all the unselected characters 
are sorted. Second, the unselected characters would try 
to swap with the characters on stencils one by one. 

After post-swap, a post-insertion stage is applied to 
further insert more characters into stencil. Different from 
the greedy insertion approach in | [M| that new char¬ 
acters can be only inserted into one row's right end. 
We consider to insert characters into the middle part 
of rows. Generally speaking, the character with higher 
profit value 0 would have a higher priority to be 
inserted into rows. We propose a character insertion algo¬ 
rithm to insert some additional characters into the rows. 

























Fig. 8. Example of maximum weighted matching based 
post character insertion, (a) Three additional characters 
a, 6, c and two rows, (b) Corresponding bipartite graph to 
represent the relationships among characters and rows. 


The insertion is formulated as a maximum weighted 
matching problem [ |34| , under the constraint that for 
each row there is at most one character can be inserted. 
Although this assumption may loss some optimality, in 
practical it works quite well as usually the remaining 
space for a row is very limited. 

Fig. [^illustrates one example of the character insertion. 
As shown in Fig. [^ (a), there are two rows (row 1, row 2) 
and three additional characters (a, 6, c). Characters a and 
b can be inserted into either row 1 or row 2, but character 
c can only be inserted into row 2. It shall be noted that 
the insertion position is labeled by arrows. For example, 
two arrows from character a mean that a can be inserted 
into the middle of each row. We build up a bipartite 
graph to represent the relationships among characters 
and rows (see Fig. [^(b)). Each edge is associated with a 
cost as character's profit. By utilizing the bipartite graph, 
the best character insertion can be solved by finding a 
maximum weighted matching. 

Given n additional characters, we search the possible 
insertion positions under each row. The time complexity 
of searching all the possibilities is 0{nmC), where m is 
the total row number and C is the maximum character 
number on each row. We propose two heuristics to 
speed-up the search process. First, to reduce n, we only 
consider those additional characters with high profits. 
Second, to reduce m, we skip those rows with very little 
empty space. 


4 E-BLOW FOR 2D0SP 

Now we consider a more general case: the blank spaces 
of characters are non-uniform along both horizontal and 
vertical directions. This problem is referred to 2D0SP 
problem. In p4| the 2D0SP problem was transformed 
into a floorpTanning problem. However, several key 
differences between traditional floorplanning and OSP 
were ignored. (1) In OSP there is no wirelength to be 
considered, while at floorplanning wirelength is a major 
optimization objective. (2) Compared with complex IP 
cores, lots of characters may have similar sizes. (3) Tra¬ 
ditional floorplanner could not handle the problem size 
of modern MCC design. 


TABLE 2 

Notations used in 2D-ILP Formulation 


W(H) 

width (height) constraint of stencil 

Wi{hi) 

width (height) of candidate Ci 

^i j i^ij ) 

horizontal (vertical) overlap between Ci and Cj 

Wij (hij) 

Wij = Wi ^ijr hij = hi ^ii 

ai 

0-1 variable, = 1 if Cj is on stencil 


4.1 ILP Formulation 

Here we will show that 2D0SP can be formulated as 
integer linear programming (ILP) as well. Compared 
with IDOSP, 2D0SP is more general: the blank spaces 
of characters are non-uniform along both horizontal 
and vertical directions. The 2D OSP problem can be also 
formulated as an ILP formulation 0. For convenience. 
Table m lists some notations used in me ILP formulation. 
The formulation is motivated by but the difference 
is that our formulation can optimize both placement 
constraints and character selection, simultaneously. 


min Ttotai (7) 

n 

s.t Ttotai > -Y^Ric-at yc£P 0l) 

i=l 

Xi -h Wij < Xj + W(2 + Pij + Qij - ai - aj) Vi, j 

Xi - Wji > - W(3 + Pij - qij - ai - aj) Vi, j 0:) 

Vi + hij < pj + i/(3 - Pij + qij - ai- aj) Vi, j 0i) 

Vi - hji > pj - H{4-pij - qij - ai- aj) Vi, j 

0 < Xi Wi < W, 0 < Pi hi < H Vi 

Pij , qij , ai : 0-1 variable Vi, j 


where indicates whether candidate q is on the stencil, 
Pij and qij represent the location relationships between 
Ci and Cj. The number of variables is 0{rr), where n 
is number of characters. We can see that if = 0, 
constraints -iZi are not active. Besides, it is easy 
to see that \wen ai = aj = 1, for each of the four 
possible choices of {pij.qij) = (0,J^, (0, (1,0), (1,1), 

only one of the four inequalities ( [7^ - ( [7p| are active. 
For example, with {ai,aj,pij,qij) = (1,1 AA)/ only the 
constraint ( [7p| applies, which allows character q to be 
anywhere a^ve character Cj . The other three constraints 
([7p|-([7p| are always satisfied for any permitted values of 
(T“^^nd {xj,pj). 

Program 0 can be relaxed to linear programming (LP) 
by replacing constraint ( [7^ as: 

0 < Pij^Qij.Cii < 1 

However, similar to the discussion in IDOSP, the relaxed 
LP solution provides no information or guideline to the 
packing, i.e., every is set as 1, and every pij is set as 
0.5. In other words, this LP relaxation provides no useful 
information to guide future rounding: all the character 
candidates are selected and no ordering relationship 
is determined. Therefore we can see that LP rounding 
method cannot be effectively applied to program 0. 



































Fig. 9. E-BLOW overall flow for 2DOSP. 


Algorithm 4 KD-Tree based Clustering 

Require: set of character candidates. 

1 : Sort all candidates by profiti) 

2 : Set each candidates q to unclustered; 

3: repeat 

4: for all unclustered candidate q do 

5: if can find similar unclustered character Cj 

then 

6 : Update information of q to incorporate cj; 

7: Label Cj as clustered; 

8 : end if 

9: end for 

10 : until no character can be merged 


4.2 Clustering based Simulated Annealing 

To deal with all these limitations of ILP formulation, an 
fast packing framework is proposed (see Fig. 0. Given 
the input character candidates, the pre-filter process 
is first applied to remove characters with bad profit 
(defined in 0). Then the second step is a clustering 
algorithm to effectively speed-up the design process. 
Followed by the final floorplanner to pack all candidates. 

Clustering is a well studied problem, and there are 
many of works and applications in VLSI However, 
previous methodologies cannot be directly applied here. 
First, traditional clustering is based on netlist, which 
provides the all clustering options. Generally speaking, 
netlist is sparse, but in OSP the connection relationships 
are so complex that any two characters can be clustered, 
and totally there are 0{in?) clustering options. Second, 
given two candidates q and Cj, there are several clus¬ 
tering options. For example, horizontal clustering and 
vertical clustering may have different overlapping blank 
space. 

The main ideas of our clustering are iteratively search 
and group each character pair (q, Cj) with similar blank 
spaces, profits, and sizes. Character q is said to be similar 
to Cj, if the following condition is satisfied: 


{ meix{\wi — Wjljwj^ \ hi — hj\/hj} < bound 

max{\shi — shj\/shj^ \svi — svj\/svj} < bound ( 8 ) 
\profiti — profitj\/profitj < bound 

where Wi and hi are the width and height of q. shi 
and svi are the horizontal blank space and vertical blank 
space of Ci, respectively. In our implementation, bound 
is set as 0 . 2 . We can see that in clustering, all the size, 
blanks, and profits are considered. 

The details of our clustering procedure are shown in 
Algorithm First all the initial character candidates 
are sorted by profiti (hrie 2 ), so those characters with 
more shot number reduction are tend to be clustered. 
Then all characters are labeled as unclustered (line 3). The 
clustering (lines 3-10) is repeated until no characters can 
be further merged. When cluster Q,Cj, the information 
of Ci is modified to incorporate cj, and the cj is labeled 
as clustered. 


For each candidate q, finding available Cj may need 
0{n), and complexity of the horizontal clustering and 
vertical clustering are both 0{n^). Then the complexity 
of the whole procedure is O(n^), where n is the number 
of candidates. 


A KD-Tree | [37) is used to speed-up the process of 
finding available pair (Q,Cj). It provides fast 0{logn) 
region searching operations which keeping the time for 
insertion and deletion small: insertion, 0{logn); deletion 
of the root, 0{n{k — l)/k); deletion of a random node, 
0{logn). Using KD-Tree, the complexity of the Algorithm 
Hcan be reduced to 0{nlogn). For instance, given nine 
maracter candidates {ci,...,C 9 } as in Fi g. [iQ (a), the 
corresponding KD-Tree is shown in Fig. [1 (JP(d). Note 
that KD-Tree can store multiple dimensional vertices, 
thus a single tree is enough to store all the information 
regarding width, height, blank spaces, and profits. For 
the sake of convenience, here characters are distributed 
only based on horizontal and vertical blank spaces. Thus 
only two dimensional space is illustrated in Fig.[^(a). To 
search candidates with similar blank space witn C 2 (see 
the shaded region of Fig.[^(a)), it may need 0{n) time to 
scan all candidates, where n is the total candidate num¬ 
ber. However, under the KD-Tree structure, this search 
procedure can be resolved in O(logn). All candidates 
scanned (ci — C 5 ) are illustrated in Fig.[OT(b). Particularly, 
after scanning the C 5 , since C 5 is out oTThe search range, 
we can make sure the whole sub-tree rooted by cj is out 
of the search range as well. 


In IS), the 2DOSP is transformed into a fixed-outline 
floorplanning problem. If a character candidate is out¬ 
side the fixed-outline, then the character would not be 
prepared on stencil. Otherwise, the character candidate 
would be selected and packed on stencil. Parquet [l38]| 
was adopted as simulated annealing engine, and Se¬ 
quence Pair was used as a topology representa¬ 
tion. In E-BLOW we apply a simulated annealing based 
framework similar to that in jS) . To demonstrate the 
effectiveness of our pre-filter and clustering methodolo¬ 
gies, E-BLOW uses the same parameters. 



















TABLE 3 

Result Comparison for IDOSP 



char 

CP 

Greedy in |24]| 

dH 

^ 

E-BLOW 


# 

# 

T 

char# 

CPU(s) 

T 

cnar# 

CPU(s) 

T 

char# 

CPU(s) 

T 

char# 

CPU(s) 

ID-l 

1000 

1 

64891 

912 

0.1 

50809 

926 

13.5 

19095 

940 

0.005 

19479 

940 

2.1 

lD-2 

1000 

1 

99381 

884 

0.1 

93465 

854 

11.8 

35295 

864 

0.005 

34974 

866 

1.7 

lD-3 

1000 

1 

165480 

748 

0.1 

152376 

749 

9.13 

69301 

757 

0.005 

67209 

766 

1.7 

lD-4 

1000 

1 

193881 

691 

0.1 

193494 

687 

7.7 

92523 

703 

0.005 

93816 

703 

4.5 

lM-1 

1000 

10 

63811 

912 

0.1 

53333 

926 

13.5 

39026 

938 

0.01 

37848 

944 

3.8 

lM-2 

1000 

10 

104877 

884 

0.1 

95963 

854 

11.8 

77997 

864 

0.01 

75303 

874 

3.5 

lM-3 

1000 

10 

172834 

748 

0.1 

156700 

749 

9.2 

138256 

758 

0.56 

132773 

774 

9.3 

lM-4 

1000 

10 

200498 

691 

0.1 

196686 

687 

7.7 

176228 

698 

0.36 

173193 

711 

7.4 

lM-5 

4000 

10 

274992 

3604 

1.0 

255208 

3629 

1477.3 

204114 

3660 

0.03 

202401 

3680 

37.9 

lM-6 

4000 

10 

437088 

3341 

1.0 

417456 

3346 

1182 

357829 

3382 

0.03 

348007 

3420 

48.4 

lM-7 

4000 

10 

650419 

3000 

1.0 

644288 

2986 

876 

568339 

3016 

0.59 

563054 

3064 

54.0 

lM-8 

4000 

10 

820013 

2756 

1.0 

809721 

2734 

730.7 

731483 

2760 

0.42 

721149 

2818 

54.7 

Avg. 

- 

- 

270680.4 

1597.6 

0.4 

259958.3 

1594.0 

362.5 

209123.8 

1611.7 

0.17 

205767.2 

1630.7 

16.6 

Ratio 

- 

- 

1.32 

0.98 

0.02 

1.26 

0.98 

19.01 

1.02 

0.99 

0.01 

1.0 

1.0 

1.0 



Horizontal Blank 


(a) (b) 

Fig. 10. KD-Tree based region searching, (a) A two 
dimensional space split by eight points; (b) The corre¬ 
sponding two dimensional KD-Tree. 

5 Experimental Results 

E-BLOW is implemented in C-f-f programming language 
and executed on a Linux machine with two 3.0GHz 
CPU and 32GB Memory. GUROBl 1^ is used to solve 
ILP/LP. The benchmark suite from ||24) are tested (ID- 
1, ..., lD-4, 2D-1, ..., 2D-4). To evaluate the algorithms 
for MCC system, eight benchmarks (IM-x) are generated 
for IDOSP and the other eight ( 2 M-x) are generated for 
the 2 DOSP problem. In these new benchmarks, character 
projection (CP) number are all set to 10. For each small 
case (lM-1, ..., lM-4, 2M-1, ..., 2M-4) the character 
candidate number is 1000 , and the stencil size is set to 
1000/im X 1000/im. For each larger case (lM-5 , ..., lM- 8 , 
2M-5, ..., 2M-8) the character candidate number is 4000, 
and the stencil size is set to 2000 /im x 2000 /im. The size 
and the blank width of each character are similar to those 
in pi) . It shall be noted that p4| is aimed for single CP 
system, for MCC system it is modified to optimize the 
total writing time of all the regions. 

5.1 Comparison for IDOSP 

For IDOSP, Table [^compares E-BLOW with the greedy 
method in | [M|, th eheuristic framework in 1 ^ , and the 
algorithms in |[25|. We have obtained the programs of 


and executed them in our machine. The results of | |25] are 
directly from their paper. Column "char #" is number of 
character candidates, and column "CP#" is number of 
character projections. For each algorithm, we report "T", 
"char#" and "CPU(s)", where "T" is the writing time 
of the E-Beam system, "char#" is the character number 
on final stencil, and "CPU(s)" reports the runtime. From 
Table we can see E-BLOW achieves better perfor¬ 
mance than both greedy method and heuristic method 
in 1^ . Compared with E-BLOW, the greedy method has 
32%more system writing time, while introduces 
27% more system writing time. One possible reason 
is that different from the greedy/heuristic methods, E- 
BLOW proposes mathematical formulations to provide 
global view. Additionally, due to the successive rounding 
scheme, E-BLOW is around 22 x faster than the work in 



E-BLOW is further compared with one recent IDOSP 
solver in Table E-BLOW found stencil placements 
with best E-Beam system writing time for 10 out of 12 
test cases. In addition, for all the MCC system cases (IM- 
1, ..., lM- 8 ) E-BLOW outperforms | [^ . One possible 
reason is that to optimize the overall throughput of the 
MCC system, a global view is necessary to balance the 
throughputs among different regions. E-BLOW utilizes 
the mathematical formulations to provide such global 
optimization. Although the linear programming solvers 
are more expensive than the deterministic heuristics in 
1251 , the runtime of E-BLOW is reasonable that each case 
can be finished in 20 seconds on average. 


We further demonstra te th e effectiveness of the fast 
ILP convergence (Section 13.3| and post-insertion (Section 
|3.51 . We denote E-BLOW-0 as E-BLOW without these 
two techniques, and denote E-BLOW-1 as E-BLOW with 
these techniques. Fig. [IT] and Fig. [T^ compare E-BLOW -0 
and E-BLOW-1, in terms of system writing time and run¬ 
time, respectively. From Fig. M we can see that applying 
fast ILP convergence and post-insertion can effectively E- 
Beam system throughput, that is, averagely 9% system 
writing time reduction can be achieved. In addition. Fig. 































12 demonstrates t he p erformance of the fast ILP conver¬ 
gence (see Section [ 3 ^ . We can see that in 11 out of 12 test 
cases, the fast ILP convergence can effectively reduce E- 
BLOW CPU time. The possible reason for the slow down 
in case lD-4 is that when fast convergence is called, if 
there are still many unsolved a^j variables, ILP solver 
may suffer from runtime overhead problem. However, if 
more successive rounding iterations are applied before 
ILP convergence, less runtime can be reported. 


□ E-BLOW-O 
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Fig. 11. The comparison of E-Beam system writing times 
between E-BLOW-O and E-BLOW-1. 


□ E-BLOW-O 
n E-BLOW-1 



Fig. 12. The comparison of runtime between E-BLOW-O 
and E-BLOW-1. 

5.2 Comparison for 2DOSP 

For 2DOSP, Table gives the similar comparison. For 
each algorithm, we also record 'T", ''char #" and 
"CPU^", where the meanings are the same with that in 
Table Compared with E-BLOW, although the greedy 
algorithm is faster, its design results would introduce 
41% more system writing time. Furthermore, compared 
with E-BLOW, although the framework in pi) puts 2% 
characters onto stencil, it gets 15% more system writing 
time. The possible reason is that in E-BLOW the charac¬ 
ters with similar writing time are clustered together. The 
clustering method can help to speed-up the packaging, 
so E-BLOW is 28X faster than pi) . In addition, after 
clustering the character number can be reduced. With 
smaller solution space, the simulated annealing engine 


is easier to achieve a better solution, in terms of system 
writing time. 

From both tables we can see that compared with p4) , 
E-BLOW can achieve a better tradeoff between runtime 
and system throughput. 

5.3 E-BLOW vs. ILP 

We further compare the E-BLOW with the ILP formu¬ 
lations ^ and 0. Although for both OSP problems 
the ILP formulations can find optimal solutions theoreti¬ 
cally, they may suffer from runtime overhead. Therefore, 
we randomly generate nine small benchmarks, five for 
IDOSP ("IT-x") and four for 2DOSP ("2T-x"). The sizes 
of all the character candidates are set to 40/im x 40/im. 
For IDOSP benchmarks, the row number is set to 1, and 
the row length is set to 200. The comparisons are listed 
in Table where column "candidate#" is the number 
of character candidates. "ILP" and "E-BLOW" represent 
the ILP formulation and our E-BLOW framework, re¬ 
spectively. In ILP formulation, column "binary#" gives 
the binary variable number. For each mode, we report 
"T", "char#" and "CPU(s)", where "T" is E-Beam system 
writing time, "char#" is character number on final stencil, 
and "CPU(s)" is the runtime. Note that in Table the ILP 
solutions are optimal. 

Let us compare E-BLOW with ILP formulation for ID 
cases (lT-1, ..., lT-5). E-BLOW can achieve the same 
results with ILP formulations, meanwhile it is very fast 
that all cases can be finished in 0.2 seconds. Although 
ILP formulation can achieve optimal results, it is very 
slow that a case with 14 character candidates (lT-5) can 
not be solved in one hour. Next, let us compare E-BLOW 
with ILP formulation for 2D cases (2T-1, ..., 2T-4). For 
2D cases ILP formulations are slow that if the character 
candidate number is 12, it cannot finish in one hour. E- 
BLOW is fast, but with some solution quality penalty. 

Although the integral variable number for each case is 
not huge, we find that in the ILP formulations, the solu¬ 
tions of corresponding LP relations are vague. Therefore, 
expensive search method may cause unacceptable run¬ 
times. From these cases ILP formulations are impossible 
to be directly applied in OSP problem, as in MCC system 
character number may be as large as 4000. 

6 Conclusion 

In this paper, we have proposed E-BLOW, a tool to solve 
OSP problem in MCC system. For IDOSP, a successive 
relaxation algorithm and a dynamic programming based 
refinement are proposed. For 2DOSP, a KD-Tree based 
clustering method is integrated into simulated annealing 
framework. Experimental results show that compared 
with previous works, E-BLOW can achieve better per¬ 
formance in terms of shot number and runtime, for both 
MCC system and traditional EBL system. Note that the 
extra cost for multiple stencils is mostly the cost of mul¬ 
tiple stencil design, thus different regions tend to have 


















































































TABLE 4 

Result Comparison for 2DOSP 



char 

CP 

Greedy in (24j 

124 ] 

E-BLOW 


# 

# 

T 

char # 

CPU(s) 

T 

cnar # 

CPU(s) 

T 

char # 

CPU(s) 

2D-1 

1000 

1 

159654 

734 

2.1 

107876 

826 

329.6 

105723 

789 

65.5 

2D-2 

1000 

1 

269940 

576 

2.4 

166524 

741 

278.1 

170934 

657 

52.5 

2D-3 

1000 

1 

290068 

551 

2.6 

210496 

686 

296.7 

178777 

663 

56.4 

2D-4 

1000 

1 

327890 

499 

2.7 

240971 

632 

301.7 

179981 

605 

54.7 

2M-1 

1000 

1 

168279 

734 

2.1 

122017 

811 

313.7 

91193 

777 

58.6 

2M-2 

1000 

1 

283702 

576 

2.4 

187235 

728 

286.1 

163327 

661 

48.7 

2M-3 

1000 

1 

298813 

551 

2.6 

235788 

653 

289.0 

162648 

659 

52.3 

2M-4 

1000 

1 

338610 

499 

2.7 

270384 

605 

285.6 

195469 

590 

53.3 

2M-5 

4000 

10 

824060 

2704 

19.0 

700414 

2913 

3891.0 

687287 

2853 

59.0 

2M-6 

4000 

10 

1044161 

2388 

20.2 

898530 

2624 

4245.0 

717236 

2721 

60.7 

2M-7 

4000 

10 

1264748 

2101 

21.9 

1064789 

2410 

3925.5 

921867 

2409 

57.1 

2M-8 

4000 

10 

1331457 

2011 

22.8 

1176700 

2259 

4550.0 

1104724 

2119 

57.7 

Avg. 

- 

- 

550115 

1218.1 

8.3 

448477 

1324 

1582.7 

389930.5 

1291.9 

56.375 

Ratio 

- 

- 

1.41 

0.94 

0.15 

1.15 

1.02 

28.1 

1.0 

1.0 

1.0 


TABLE 5 
ILP v.s. EBLOW 



candidate# 

ILP 

E-BLOW 

binary# 

T 

char# 

CPU(s) 

T 

char# 

CPU(s) 

lT-1 

8 

64 

434 

6 

0.5 

434 

6 

0.1 

lT-2 

10 

100 

1034 

6 

26.1 

1034 

6 

0.2 

lT-3 

11 

121 

1222 

6 

58.3 

1222 

6 

0.2 

lT-4 

12 

144 

1862 

6 

1510.4 

1862 

6 

0.2 

lT-5 

14 

196 

NA 

NA 

>3600 

2758 

6 

0.1 

2T-1 

6 

66 

60 

6 

37.3 

207 

5 

0.1 

XT-2 

8 

120 

354 

6 

40.2 

653 

7 

0.1 

2T-3 

10 

190 

1050 

6 

436.8 

4057 

4 

0.1 

2T-4 

12 

276 

NA 

NA 

>3600 

4208 

5 

0.2 


specific stencils to improve the throughput. However, if 
a shared stencil is well-designed and optimized that such 
sharing can achieve very comparable throughput, we can 
even reduce the stencil design cost. In that situation, 
sharing stencil design could be attractive, especially for 
the companies that have limited design budget. As EBL, 
including MCC system, are widely used for mask mak¬ 
ing and also gaining momentum for direct wafer writing, 
we believe a lot more research can be done for not only 
stencil planning, but also EBL aware design. 
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Appendix 

PROOF OF THEOREM [T] 

Lemma 5. BSS problem is in NP. 

Proof: It is easy to see that BSS problem is in NP. 
Given a subset of integer numbers S' G S, we can add 
them up and verify that their sum is s in polynomial 
time. □ 

Lemma 6. 3SAT <p BSS. 

Proof: In 3SAT problem, we are given m clauses 
{Ci,C 2 ,... over n variables ^ 2 , • • •,^n}- Be¬ 

sides, there are three literals in each clause, which is 
the OR of some number of literals. Eqn. (|^ gives one 
example of 3SAT, where n = 4 and m = 2. 

{yi V ^3 V ^ 4 ) A {yi V ^2 V ^ 4 ) (9) 

Without loss of generality, we can have the following 
assumptions: 



1) No clause contains both variable yi and yi. Other¬ 
wise, any such clause is always true and we can 
just eliminate them from the formula. 

2) Each variable yi appears in at least one clause. 
Otherwise, we can just assign any arbitrary value 
to the variable yi. 

To convert a 3SAT instance to a BSS instance, we 
create two integer numbers in set S for each variable 
yi and three integer numbers in S for each clause Cj. 
All the numbers in set S and s are in base 10. Besides, 
]^Qn+ 2 m < ^. < 2 - 10 ’^+^"^, so that the bounded constraints 
are satisfied. All the details regarding S and s are defined 
as follows. 


• In the set S, all integer numbers are with n + 2m +1 
digits, and the first digit are always 1 . 

• In the set S, we construct two integer numbers ti 
and fi for the variable yi. For both of the values, 
the n digits after the first 'V serve to indicate the 
corresponding variable in S. That is, the digit in 
these n digits is set to 1 and all others are 0. For the 
next m digits, the digit is set to 1 if the clause 
Cj contains the respective literal. The last m digits 
are always 0 . 

• In the set S, we also construct three integer num¬ 
bers Cji^Cj 2 and c^s for each clause Cj. In Cjk where 
k = {1,2,3}, the first n digits after the first '1' are 
0 , and in the next m digits all are 0 except the 
index setting to k. The last m digits are all 0 except 
the index setting to 1 . 

• T = (n + m) • + So, where sq is an integer 

number with n + 2m digits. The first n digits of sq 
are 1, in the next m digits all are 4, and in the last 
m digits all are 1 . 

Based on the above rules, given the 3SAT instance in 
Eqn. (|9> the constructed set S and target s are shown 
in Fig. ^ Note that the highest digit achievable is 9, 
meaningmat no digit will carry over and interfere with 
other digits. 


Claim 1. The 3SAT instance has a satisfying truth assign¬ 
ment iff the constructed BSS instance has a subset that adds 
up to s. 


Proof of ^ part of Claim: If the 3SAT instance has a 
satisfying assignment, we can pick a subset containing 
all ti for which yi is set to true and fi for which yi is set 
to false. We should then be able to achieve 5 by picking 
the necessary Cjk to get 4's in the s. Due to the last m 'V 
in 5 , for each j G [m] only one would be selected from 
{cji,Cj 2 , Cjs}. Besides, we can see totally nTm numbers 
would be selected from S. 

Proof of part of Claim: If there is a subset S' e S 
that adds up to s, we will show that it corresponds to 
a satisfying assignment in the 3SAT instance. S' must 
include exactly one of ti and fi, otherwise the ith digit 
value of So cannot be satisfied. If ti G S', in the 3SAT we 
set yi to true; otherwise we set it to false. Similarly, S' 
must include exactly one of Cji,Cj 2 and c^s, otherwise 
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Fig. 13. The constructed BSS instance for the given 3SAT 
instance in 


the last m digits of s cannot be satisfied. Therefore, 
all clauses in the 3SAT are satisfied and 3SAT has a 
satisfying assignment. 

□ 

For instance, given a satisfying assignment of Eqn. 
{yi =0,2/2 = 1,2/3 =0,2/4 = 0 ), the corresponding 
subset S' is {/i = 110000100,^2 = 101000100, /a = 

100101000,/4 = 100011100, ci2 = 100002010, C21 = 

100000101 }. We set s = (m + n) • + sq, where 

So = 11114411, and then s = 611114411. We can see that 

/l + ^2 + /s + /4 + 512 + 521 = 5. 

Combining Lemma and Lemma we can achieve 
the following theorem. 





